Telco Assessment

Jaime Paz

2022-02-22

Problem Assessment

Our assessment consists of on a geolocation information of a set of 1,752 wireless subscribers for a phone company in various cells across Nova Scotia. The historical data was provided as of September 1st, 2020. We have also been provided a dataset containing information for the Poseidon CTC Mall and its geofence cell list.

By using the data science toolbox, we need to understand how much average time the subscribers are spending at the Poseidon CTC Mall. The goal of the project is to create an executive summary and create business insights that will help the marketing team to achieve this objective.

Business Problem

Recently, there has been several discussions within our marketing team to develop a strategy to approaching our clients more efficiently, accordingly to the time that they spend in the current Poseidon CTC Mall. As a team, we need to understand what should be the better time frame in which we can target those clients by using the average dwell time, which is a metric to show the “amount of time” a device uses a particular channel.

Our current point of interest (POI) to develop this strategy, corresponds to the Poseidon CTC Mall.

Business Value

By developing a strategy to target our clients efficiently, we are creating organic growth to the business which is achieved by enhancing our sales revenue. By putting in production our final data product, we will be able to get to know our clients better and improve our Voice of Customer.

Tools used

The tools used for this project are:

Python Programming: Python provides a vast set of utilities do develop data science and machine learning through the most prominent packages: Pandas, Scikit-Learn, Numpy and others.

R - Reticulate: Reticulate is a tool to run both Python and R on the same environment

R Markdown: R Programming (through RStudio) provides a set of tools to publish our work using documents and HTML sources, to provide better presentations to our executives or audience.

ASSESSMENT PART 1

1. Loading packages

First we load reticulate package to write Python / R code in our Markdown enviroment:

# Use only once:

#Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python3.7")
#library(reticulate)
#use_python('/usr/bin/python3', require = T)

Second, we import Python packages

import pandas as pd              # package for data wrangling and transformation
import numpy as np               # package for linear algebra
import seaborn as sns            # advanced graphical interface package 
import matplotlib.pyplot as plt  # basic graphical interface package
import copy
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans

2. Collecting & Preparing Data

The fist step is to get the data which has been stored in a csv format. In addition to the files already provided, I was able to find supporting information that would help our analysis at https://mcc-mnc-list.com/list.

With the cgi dataset I was able to get information about:

MCC: Mobile Country Code MNC: Mobile Network Code LAC: Location Area Code CI: Cell Identity

# geo location data 
geo = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/geolocation_data.csv')

# Poseidon CTC Mall Geofence data
poseidon = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/Poseidon CTC Mall Geofence Cell List.csv')

# cgi data
cgi = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/cgi.csv')

Checking shape and data structure

geolocation data

print(geo.shape)
## (408917, 3)
geo.head()
##                                              imsi_id  ...                 cgi_id
## 0  778fb33d64b6dcbc67c79ca6d35f7820f993f26c74502c...  ...  4G:302-220-113919-131
## 1  753c0c75ac582307c3b0d04bbe4cfe7d118684d1b4060e...  ...   4G:302-220-113155-41
## 2  05f8982252e683f1fed1f3ebd42f60848587a4e270ddcc...  ...   4G:302-220-113937-10
## 3  d5ce6f7f332850a23ae4700231c87dec634a1e7832304d...  ...    302-220-11201-23458
## 4  05f8982252e683f1fed1f3ebd42f60848587a4e270ddcc...  ...   4G:302-220-113937-10
## 
## [5 rows x 3 columns]

poseidon data

print(poseidon.shape)
## (105, 1)
poseidon.head()
##   Poseidon CTC Mall Geofence Cell List
## 0                4G:302-220-113122-132
## 1                 4G:302-220-113122-42
## 2                4G:302-220-113122-112
## 3                  4G:302-220-113122-2
## 4                4G:302-220-113122-131
poseidon['Poseidon CTC Mall Geofence Cell List'].nunique()
## 75

We have 75 unique cells (that fences the POI of “Poseidon CTC Mall”) to which we are interested in. So we drop the duplicate values and have it ready to be merged:

poseidon.drop_duplicates(subset=['Poseidon CTC Mall Geofence Cell List'], inplace = True)
poseidon.shape # indeed we have 75 cells
## (75, 1)
poseidon.head() 
##   Poseidon CTC Mall Geofence Cell List
## 0                4G:302-220-113122-132
## 1                 4G:302-220-113122-42
## 2                4G:302-220-113122-112
## 3                  4G:302-220-113122-2
## 4                4G:302-220-113122-131

CGI(Cell Global Identity)

cgi.shape
## (3, 6)
cgi.head()
##    MCC  ...                         Operator
## 0  302  ...                   Telus Mobility
## 1  302  ...                    Bell Mobility
## 2  302  ...  Shared Telus, Bell, and SaskTel
## 
## [3 rows x 6 columns]

Converting data types:

cgi.MCC = cgi.MCC.astype('object')
cgi.MNC = cgi.MNC.astype('object')

3. Merging the data


df = pd.merge(poseidon, geo, left_on = 'Poseidon CTC Mall Geofence Cell List', right_on = 'cgi_id')
df.head()
##   Poseidon CTC Mall Geofence Cell List  ...                 cgi_id
## 0                4G:302-220-113122-132  ...  4G:302-220-113122-132
## 1                4G:302-220-113122-132  ...  4G:302-220-113122-132
## 2                4G:302-220-113122-132  ...  4G:302-220-113122-132
## 3                4G:302-220-113122-132  ...  4G:302-220-113122-132
## 4                 4G:302-220-113122-42  ...   4G:302-220-113122-42
## 
## [5 rows x 4 columns]
df.shape
## (432, 4)

No of unique clients to be analized: 33

df.imsi_id.nunique()
## 33

4. Data verification / cleaning


telco_nulls = df.isna().sum()[df.isna().sum()!= 0].reset_index().rename(columns=
                                                                        {'index': 'col_name', 0: '#'})
telco_nulls['%'] = telco_nulls['#'] / df.shape[0]
telco_nulls
## Empty DataFrame
## Columns: [col_name, #, %]
## Index: []

No NULL values were found on the data.

5. Data Transformation & Feature Engineering

Extract Information from CGI: MCC - MNC - LAC - CI

new = df["cgi_id"].str.split("-", n = 4, expand = True)
broad_MCC = pd.DataFrame(new[0])
# If broadband does not contain 4G then place "other"
broad_MCC[1] = broad_MCC[0].map(lambda x: "4G" if "4G" in x else 'Other')

Beacuse MCC is always a 3 digit character, then we extract the last 3 characters:

broad_MCC[2] = broad_MCC[0].str[-3:]
broad_MCC[2].nunique() #indeed we have only 1 value.
## 1
broad_MCC.head()
##         0   1    2
## 0  4G:302  4G  302
## 1  4G:302  4G  302
## 2  4G:302  4G  302
## 3  4G:302  4G  302
## 4  4G:302  4G  302
df['broadband'] = broad_MCC[1]
df['MCC'] = broad_MCC[2]
df['MNC'] =  new[1]
df['LAC'] =  new[2]
df['CI'] =  new[3]
df.head()
##   Poseidon CTC Mall Geofence Cell List  ...   CI
## 0                4G:302-220-113122-132  ...  132
## 1                4G:302-220-113122-132  ...  132
## 2                4G:302-220-113122-132  ...  132
## 3                4G:302-220-113122-132  ...  132
## 4                 4G:302-220-113122-42  ...   42
## 
## [5 rows x 9 columns]

Dropping unnessary columns

df.drop(columns=['Poseidon CTC Mall Geofence Cell List', 'cgi_id'], inplace = True)
df.head()
##                                              imsi_id  ...   CI
## 0  b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8...  ...  132
## 1  2f65da3d1f0ed00d831a2037868732a262523d6231e9bf...  ...  132
## 2  450c1b07f639674e3e004251fb86ce273ac21bc1c1754e...  ...  132
## 3  2f627a032a70455a80760c51c0c8cc9282c87f52a25703...  ...  132
## 4  044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf...  ...   42
## 
## [5 rows x 7 columns]

converting data types to able to merge data

df.MCC = df.MCC.astype('str')
df.MNC = df.MNC.astype('str')
cgi.MCC = cgi.MCC.astype('str')
cgi.MNC = cgi.MNC.astype('str')
telco = pd.merge(df, cgi, how = 'left', left_on = ['MCC', 'MNC'], right_on = ['MCC', 'MNC' ])
telco.head()
##                                              imsi_id  ...        Operator
## 0  b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8...  ...  Telus Mobility
## 1  2f65da3d1f0ed00d831a2037868732a262523d6231e9bf...  ...  Telus Mobility
## 2  450c1b07f639674e3e004251fb86ce273ac21bc1c1754e...  ...  Telus Mobility
## 3  2f627a032a70455a80760c51c0c8cc9282c87f52a25703...  ...  Telus Mobility
## 4  044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf...  ...  Telus Mobility
## 
## [5 rows x 11 columns]
telco.shape
## (432, 11)

As we can see, we have merged the data correctly. Some columns showed previously, was to show the type of information that we have in our hands. For instance, Country and Brand: They show the country that we are analyzing, and brand has similarity with the operator. ISO also contains “CA”.

From this point on, we are going to these 2 columns:

telco.drop(columns=['ISO', 'Country', 'Brand'], inplace = True)
telco.head()
##                                              imsi_id  ...        Operator
## 0  b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8...  ...  Telus Mobility
## 1  2f65da3d1f0ed00d831a2037868732a262523d6231e9bf...  ...  Telus Mobility
## 2  450c1b07f639674e3e004251fb86ce273ac21bc1c1754e...  ...  Telus Mobility
## 3  2f627a032a70455a80760c51c0c8cc9282c87f52a25703...  ...  Telus Mobility
## 4  044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf...  ...  Telus Mobility
## 
## [5 rows x 8 columns]

6. Exploratory Data Analysis

Our data exploration consists in analyzing our categorical & numerical variables. Next, we will be focusing on the time-based analysis.

6.1. Client Analysis


telco.imsi_id.nunique()
## 33
telco.shape[0]
## 432

Insight 1: On September 1st 2020, we had a total of 432 visits covered by our current POI (located at Poseidon CTC Mall, Canada). In total, the POI served 33 unique clients.

6.2. Broadband Analysis

sns.countplot(telco.broadband)

round(100 * telco.groupby(['broadband']).size() / telco.shape[0])
## broadband
## 4G       34.0
## Other    66.0
## dtype: float64

Insight 2: Our POI demanded 34% of 4G broadband and 66% of the rest. This 66% was catalogued as “other” since there was no specification on the dataset about the broadband covered on this particular points.

6.3. MCC / MNC / LAC / CI Analysis

telco.groupby(['MCC']).size()
## MCC
## 302    432
## dtype: int64
telco.groupby(['MNC']).size()
## MNC
## 220    431
## 880      1
## dtype: int64
sns.countplot(telco.LAC)
plt.show()

LAC = pd.DataFrame()
LAC['n'] = telco.groupby(['LAC']).size()
LAC['%'] = round(100 * (LAC['n'] / telco.shape[0]))
LAC
##           n     %
## LAC              
## 11204   285  66.0
## 113122  104  24.0
## 123353   43  10.0
CI = pd.DataFrame()
CI['n'] = telco.groupby(['CI']).size().sort_values(ascending = False)
CI['%'] = round(100 * (CI['n'] / telco.shape[0]))
CI
##          n     %
## CI              
## 24122  111  26.0
## 24353   79  18.0
## 21122   50  12.0
## 41      48  11.0
## 25122   43  10.0
## 111     37   9.0
## 131     23   5.0
## 1       18   4.0
## 42       5   1.0
## 132      4   1.0
## 2        4   1.0
## 112      3   1.0
## 71       3   1.0
## 22122    2   0.0
## 11       1   0.0
## 22       1   0.0

Insight 3: On September 1st 2020, our POI registered two types of MNC (Mobile Network Code): 220 (441 times) and 880 (1 time). We were able to find the information regarding this codes, and extract the operator names source: https://mcc-mnc-list.com/list: “Telus Mobility” (220) and “Shared Telus, Bell, and SaskTel (880)”

Insight 4: In addition to the previous insight, our group of cell towers is divided in 3 categories: 11204, 113122 and 123353. There was a high demand on the LAC 11204 (285 times) of about 66%; this is the most critical group so far. Our POI registered a total of 16 cells serving our Poseidon CTC Mall and most demanded CIs (Cells Ids) were: 24122 (26%), 24353 (18%) and 21122 (12%). The rest is serving around 44% of the time.

6.4 Time-Based Data (event_ts)

Formatting time-based data (convert date time to minutes)

telco.event_ts = df.event_ts.astype('datetime64[ns]')
telco = telco.sort_values(by = ['event_ts'], ascending = True)
time = pd.DatetimeIndex(telco.event_ts)  # converts to datetime object index
telco['hours'] = (time.hour  + time.minute / 60 + time.second / 3600 )
telcov2 = copy.deepcopy(telco)
telco.head(10)
##                                                imsi_id  ...     hours
## 61   9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57...  ...  0.000000
## 52   9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57...  ...  0.000000
## 70   9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57...  ...  0.000000
## 149  9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5...  ...  0.550000
## 148  9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5...  ...  0.550000
## 165  9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5...  ...  0.566667
## 166  9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5...  ...  0.566667
## 171  e41a1559c092dfa12acf39072821ce575665fdc4c2a46c...  ...  0.616667
## 186  d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be...  ...  0.800000
## 155  d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be...  ...  0.800000
## 
## [10 rows x 9 columns]
output = telco.groupby('imsi_id').agg(
    dwell_time = ('hours', 'mean' ),
    r_visits = ('hours', 'count')
     ).sort_values(by = ['dwell_time'], ascending = True).reset_index()

Next, we present the list of our 33 clients according to their dwell_time and the number of recurrent visits by each one of them:

pd.set_option('display.max_rows', 50)
output
##                                               imsi_id  dwell_time  r_visits
## 0   9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5...    0.558333         4
## 1   d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be...    0.800000         2
## 2   9366f7e222a3682413fc4b3480289bf5ee8bee9f5c9354...    1.316667         1
## 3   f48b2e8df1f9f912c788b57e6f2c3246e08b74440f033e...    1.966667         6
## 4   f70cd60e8a80eebd2422c0db3d8fb5d9b441bb6ef22c71...   10.123333         5
## 5   e41a1559c092dfa12acf39072821ce575665fdc4c2a46c...   10.822917        16
## 6   10ef0c74bc117442d7af58334e1a7f6d7e6e374a705e1f...   11.524242        11
## 7   9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57...   12.148148         9
## 8   c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6...   13.783965       132
## 9   e2ee0f057e15ceedbb10b6c7ba335487fd20cfb7b1ca55...   13.992857         7
## 10  12f577a9e7c23830cfc662a8954126c96a2f966c248457...   14.255556        12
## 11  d993bd44520ea984f45182b7398c98a2af4e0da09ef423...   14.733333         1
## 12  321b7cf31816a63fb196a208c3eeea20de04cd59dc35d8...   15.870000         5
## 13  044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf...   15.920833         4
## 14  cbc12ecb1aa2a955a6426a3114f1a3230ff618560be98a...   16.233333         5
## 15  2112b7c11b6fd3e8b5c78af39566e4834685bdcaf56c6b...   16.600000         4
## 16  cfef7cf475779a432e73e64641bc5d2abb58e5eebc15ce...   17.488889         3
## 17  7948cc9c2c4ede47301ef73252a41fdf9b8e07817f5604...   17.503604        37
## 18  2f65da3d1f0ed00d831a2037868732a262523d6231e9bf...   17.633333         2
## 19  16d323326378966732abaafddad4027288389ee110cbd1...   17.683333         3
## 20  f85657aee6b4977e8909ea8de56f26d3d7010a40e5d210...   17.773232        33
## 21  f043c2b9f621827a588b37bf25f7a28f41ce3ce5928dbb...   18.033333         1
## 22  b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8...   18.561538        13
## 23  05b5aa3eb6db1abda56d8cd0f15e6f5ce2307d16cc9e05...   19.327778         3
## 24  9b03a42cf240abbb766ddbe3e5091bfc1cc33c0bf08409...   19.576316        19
## 25  2f627a032a70455a80760c51c0c8cc9282c87f52a25703...   19.905556        18
## 26  2acb2650d37de9efe5ef71d02c0f9ac8dba554f6f70370...   19.916667        12
## 27  16a2a5883d4b3b6f5d4f688e43887bbf712769c612b9da...   19.983333         1
## 28  1581807a0d2e3c44b53791d8601ae83e0e1ac51142d84d...   21.066667         2
## 29  bc9e5df525777dbe77e262794de72923365aa37b3c019f...   21.129902        34
## 30  85d444651f5123ca9e97d30ff30a40f0a65d0487ac37df...   21.250000         4
## 31  450c1b07f639674e3e004251fb86ce273ac21bc1c1754e...   23.471111        15
## 32  059e71d6579dadcfa0cef78ee779b671ac254549a2b4bb...   23.843750         8

Insight 5: Our greatest potential client registered on September 1st, 2020 was:

output[output.r_visits == output.r_visits.max()]
##                                              imsi_id  dwell_time  r_visits
## 8  c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6...   13.783965       132

In addition, we can target this client in business hours from 1 PM - 2 PM.

Insight 6: Our most concurrent currents registered in our Poseidon CTC Mall POI were:

output.nlargest(3, 'r_visits')
##                                               imsi_id  dwell_time  r_visits
## 8   c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6...   13.783965       132
## 17  7948cc9c2c4ede47301ef73252a41fdf9b8e07817f5604...   17.503604        37
## 29  bc9e5df525777dbe77e262794de72923365aa37b3c019f...   21.129902        34

7. DWELL TIME AVERAGE ANALYSYS

Displaying the distribution of dwell time by the number of recurrent visits:

output_bin = copy.deepcopy(output)

output_bin['points_bin'] = pd.qcut(output_bin['dwell_time'], q=8)

#bining the data
output_bin = output_bin.groupby('points_bin').agg(
    r_visits = ('r_visits', 'sum')
     ).reset_index()

#plotting the data:     
plt.figure(figsize = (15, 5))
sns.barplot(x = output_bin.points_bin, y = output_bin.r_visits, color = 'blue')
# Labeling of the plot
plt.xlabel('dwell time (hrs)') 
plt.ylabel('Recurrent visits')
plt.title('Distribution of dwell time (Poseidon CTC Mall)')
plt.show()

Insight 6 (a): Busiest hours at Poseidon CTC Mall on Sept 1st 2020 occurred between 10 hrs and 13 hrs, with a number of recurrent > 160. Particularly, this is the segment that our marketing work force should focus on.

Displaying the relationship between dwell time and the number of recurrent visits:

plt.figure(figsize = (15, 5))
sns.scatterplot(x = output.dwell_time, y = output.r_visits)
# Labeling of the plot
plt.xlabel('dwell time (hrs)') 
plt.ylabel('Recurrent visits')
plt.title('Dwell_time vs Recurrent visits (Poseidon CTC Mall)')
plt.show()

output[output.r_visits == output.r_visits.max()]
##                                              imsi_id  dwell_time  r_visits
## 8  c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6...   13.783965       132

Insight 6 (b): We can see a potential outlier which corresponds to the previous client stated before. Apart from that, the number of recurrent visits seems to stay below 40. There are interesting points during the very early mornings. Probably these are people attenting on security? Probably there is pretty nightlife in the place? Considering if there are any cinema theater located at the mall, most probably these are late schedules due to movie premieres.

ts = copy.deepcopy(telcov2)
ts['cnt'] = 1
#ts = ts.groupby('event_ts').agg({"cnt": "sum"}).reset_index()
ts.event_ts = ts.event_ts.astype(str)

8. Client Trend Analysis

Loading R Packages

library(reticulate)
ts_data <- py$ts
# Forecasting Libraries ----
library(forecast)    
library(tidymodels)   
library(modeltime)    
library(tidyverse)    
library(lubridate)    
library(timetk)       
library(plotly)
#renaming colums

ts_data$Date <- ts_data$event_ts 

#converting to datetime
ts_data$Date <- as.POSIXct(as.character(ts_data$Date), format = "%Y-%d-%m %H:%M:%S")
ts_data <- as_tibble(ts_data)

Plotting the data

#summarize by hour

p1 <- ts_data %>%  group_by(Date) %>%
  summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
  plot_time_series(Date, no_visits,  .smooth = FALSE,
                   .title = "Poseidon CTC Malls Visits Trend - Sep 1st 2020", .y_lab =  "count of client visits" ) 
p1

Insight 8: Clearly, we can spot an increasing trend regarding the number of visits by our in our Poseidon CTC Mall. By the end of the day, we have registered 54 unique clients.

Lets understand the patterns of our both broadband technologies:

#summarize by hour

p2 <- ts_data %>%  group_by(broadband) %>%
  summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
  plot_time_series(Date, no_visits,  .smooth = TRUE,
                   .color_var = broadband ,
                   .title = "Poseidon CTC Broadband types - Sep 1st 2020", 
                   .y_lab =  "count of client visits", .facet_ncol = 2,
                   .facet_scales = "free" )  
p2

Insight 9: 4G broadband technology is not equaly consumed in our POI. It will be interesting to know which are these ‘other’ broadband technologies.

Lets understand the patterns in each CELL that is consumed by our LAC Code:

#summarize by hour

p3 <- ts_data %>%  group_by(LAC) %>%
  summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
  plot_time_series(Date, no_visits,  
                   .title = "Poseidon CTC Mall by LAC - Sep 1st 2020", 
                   .y_lab =  "count of client visits", .facet_ncol = 2,
                   .facet_scales = "free" ) 
p3

Insight 10: Local Are Code 11204 is highly demanded in several houurs of the day. LAC 123353 doesn’t seem to show any particular movement except for the end of the day, specially at 7:00 PM.

Let’s see our top (most consumed) CIs:

#summarize by hour

p4 <- ts_data %>%  group_by(CI) %>%
  summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>% arrange(desc(no_visits)) %>% 
  filter(CI %in% c('111', '24122', '131', '24122', '41')) %>%
  plot_time_series(Date, no_visits,  .smooth = FALSE,
                   .title = "Poseidon CTC Mall by CI - Sep 1st 2020", 
                   .y_lab =  "count of client visits", .facet_ncol = 1,
                   .facet_scales = "free" ) 
p4

Insight 11: Apparently, not al CIs are highly during the day. For instance CI 111 only shows movement during the last hours of the nigh, and CI 31 shows fluctuations betwee 12::00 hrs till the end of the day.

ASSESSMENT PART 2

Once the average dwell times analysis is complete, cluster the subscribers seen inside Poseidon CTC Mall into various segments, and explain the various segments. We would also like to know how did you select the #segments (or K value), if at all.

1. Selecting the data

Storing the data for the columns: dwell_time and r_visits (by each distinct client):

output_2 = output
X = output_2.iloc[:, [1,2]].values  

2. Model Building

# By applying K-Means algorithm, we are going to find the best cluster number 
# The idea, is to reduce WCSS (Within - Cluster - Sum of Squares)
from sklearn.cluster import KMeans
wcss = []

# Getting the maximun numbers of clusters (to begin with, we set 10 by default)

# According to the official documentation of scikit learn, we use:

#‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up 
#convergence.

# random_state = 123, just be sure a fixed randomness

# Inertia tell us calculates the sum of distances of all the points within a cluster from the centroid of that
# cluster

# by default, we chose to select an 'auto' algorithm for converging. Euclidean distance is used in this case,
# but it might change according to our performance.

for i in range(1,11):
    kmeans = KMeans(n_clusters= i, max_iter = 300, 
                    init='k-means++', random_state=123, 
                    algorithm='auto')
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
## KMeans(n_clusters=1, random_state=123)
## KMeans(n_clusters=2, random_state=123)
## KMeans(n_clusters=3, random_state=123)
## KMeans(n_clusters=4, random_state=123)
## KMeans(n_clusters=5, random_state=123)
## KMeans(n_clusters=6, random_state=123)
## KMeans(n_clusters=7, random_state=123)
## KMeans(random_state=123)
## KMeans(n_clusters=9, random_state=123)
## KMeans(n_clusters=10, random_state=123)

3. Visualizing results


# Viewing ELBOW method to get the optimal value of K

plt.plot(range(1,11), wcss)
plt.title('Elbow Method')
plt.xlabel('No. of clusters')
plt.ylabel('wcss')
plt.show() 

From the graph above, we can see that according to the “Elbow validation approach”, if we choose 6 clusters, then we will achieving a minium error of wcss


#Model Build
kmeans_model = KMeans(n_clusters= 6, max_iter = 300, 
                    init='k-means++', random_state=123, 
                    verbose = 1, algorithm='auto')
output_2 ['y_pred'] = kmeans_model.fit_predict(X)
## Initialization complete
## Iteration 0, inertia 476.2443094109775
## Iteration 1, inertia 305.4512442082231
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 545.1293060336468
## Iteration 1, inertia 312.2361716096957
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 450.11941478483624
## Iteration 1, inertia 371.25146932287197
## Iteration 2, inertia 315.6318486261001
## Iteration 3, inertia 306.7580124449414
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 437.3813717106833
## Iteration 1, inertia 354.67313867714694
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 483.9692655293205
## Iteration 1, inertia 360.8337449127447
## Iteration 2, inertia 310.94913777733245
## Iteration 3, inertia 305.45124420822316
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 544.6388108996249
## Iteration 1, inertia 354.6731386771469
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 448.99713376178613
## Iteration 1, inertia 306.7580124449414
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 408.68816905742136
## Iteration 1, inertia 306.7580124449414
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 548.0569280821813
## Iteration 1, inertia 404.6820345169024
## Iteration 2, inertia 383.5558342152465
## Iteration 3, inertia 373.7482806253318
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 520.4707242606468
## Iteration 1, inertia 346.1610231213367
## Iteration 2, inertia 336.1378259792391
## Iteration 3, inertia 311.73039202918505
## Iteration 4, inertia 306.7580124449414
## Converged at iteration 4: strict convergence.

# plot results
plt.figure(figsize=(10,6))

plot_clusters = sns.scatterplot(x='dwell_time', y='r_visits', hue='y_pred',palette='Set1', s=100, alpha=0.2,  
                data=output_2).set_title('KMeans Clusters (6)', fontsize=15)


plt.show()

By looking at the previous graph we are able to segment our customers using 4 clusters. It’s important to remark the most of our clients stay below 40 visits a day. There is one particular cluster (1) that was highly skewed greater than 120. This shows that from now on, any other potential client which behaves similarly will belong to this particular cluster.

FINAL WORDS AND RECOMMENDATIONS

(1) By analyzing the performance of our Poseidon CTC PO dwell time, we were able to understand the patterns and relationships on our data that will help the marketing department to know and approach better to our clients. We were capable to segment the most prominent clients in our current portfolio. This will also allow the sales leaders to focus specifically on segments in which they can target new customers or making offerings to the current ones.

(2) By understanding our broadband / MNC / LAC, we were able to find the frequency and cyclical patterns in our client behavior. That would help the marketing leaders to focus specifically in particular times in the current day.

(3) We used machine learning to find a solution in clustering our current clients according to their dwell time and count of visits. If there are any new clients, our algorithm will be capable to assign a cluster to a particular client. It’s highly recommended that we keep up updating our models periodically, since the structure, behavior of the clients and systematic events could affect the performance of those models.

(4) Next steps are to publish put our machine learning in production or embed the results into a dashboard. As per the time series analysis, is also recommended to build a forecasting model capable to predict the number of clients demanded in our current Poseidon CTC PO.